Exploration of Prosper data

Gary Ng

Introduction

Peer-to-peer (P2P) lending may not be a familiar concept to most, but online marketplaces that connect individual borrowers with individual investors (i.e. people like you and me, not ‘Big Banks’) have existed in the United States for nearly a decade and saw considerable growth.

In this study, I will be exploring a rich dataset provided by Prosper, the first P2P lending marketplace founded in 2005. Prosper is a two-sided marketplace where loan requests from borrower members are listed for investor members to choose to fund; each loan may be funded by more than one investor, and investor members are encouraged to diversify their risk by partially funding over than a hundred loans.

This study is meant as an exploratory data analysis aimed to develop an initial understanding of Prosper’s business. Specifically, we are interested in answering three broad questions:

  1. What are the characteristics of Prosper’s borrower members?
  2. What is the nature of the loans originated?
  3. How has Prosper’s business grown over the past years?

Data Set

The dataset contains all 113,937 records of loans listed on Prosper’s marketplace since the company’s inception. There are 81 variables in the original dataset, including demographic and credit information about borrowers, details and statuses of the loans, and much more. We will work with a subset of 31 variables listed below that are of interest for our research questions.

## 'data.frame':    113937 obs. of  33 variables:
##  $ ListingKey                 : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingCreationDate        : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ Term                       : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                 : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerRate               : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ EstimatedEffectiveYield    : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss              : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn            : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.    : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.      : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore               : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.  : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState              : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                 : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus           : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ CreditScoreRangeLower      : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper      : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ RevolvingCreditBalance     : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization        : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ DebtToIncomeRatio          : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable           : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome        : num  3083 6125 2083 2875 9583 ...
##  $ ProsperPrincipalBorrowed   : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ LoanNumber                 : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount         : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate        : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter     : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                  : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ PercentFunded              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ InvestmentFromFriendsCount : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                  : int  258 1 41 158 20 1 1 1 1 1 ...

Let us first clarify what each record represents. According to the documentation, each record is a listing created by a borrower seeking a loan. Only when the listing receives sufficient bids from lenders to reach the amount requested will it become a loan.

One would expect at least some listings to be unfunded. However, every listing in the dataset had a valid LoanNumber and LoanOriginationDate, suggesting all of them were eventually funded. The natural assumption is that Udacity pre-processed the data and removed unfunded listings, or that Prosper did not provide them in the first place. Regardless, this allows us to simply treat each record as a loan.

However, when inspecting these variables, I noticed that there were multiple records that share the same LoanNumber which seems odd.

## Source: local data frame [6 x 2]
## 
##   LoanNumber     n
##        (int) (int)
## 1     126059     6
## 2     118017     4
## 3     122984     4
## 4     127679     4
## 5     133965     4
## 6     103842     3

One theory could be that the borrower reposted the same request after his previous one expired without getting funded, and when the request finally got approved, all related listings might have been populated with the same LoanNumber. However, inspection of one such instance (#126059) suggests this is not the case. All six records associated with this LoanNumber had the same ListingKey, ListingCreationDate, and EstimatedEffectiveYield. In fact, every field was identical with the sole exception of ProsperScore, a custom risk score that Prosper built using historical data. There does not appear to be a justification for these duplicate records. Fortunately, only approximately 1.5% of the records were affected. For the purpose of our analysis, we will remove these duplicates and assume the integrity of the remaining dataset is intact. This leaves us with 112,239 records.

## [1] 112239

Now that we have established each record as a loan, let us explore the relationship between loans and borrowers. The documentation mentioned “a borrower may only have one active listing at a particular moment in time”, but a borrower could well have borrowered more than once over the span of ten years considered. The table and histogram below shows the distribution of loans per member. Of 90,161 unique borrowers, a vast majority of ~83% had only 1 loan, ~12% had 2 loans, and none had more than 9 loans.

The relationship between borrowers and loans is important because we are interested in understanding Prosper’s borrower profile. If the distribution was highly skewed (i.e. a few borrowers having many loans), then it might be worth collapsing multiple records of the same borrower into a single record to avoid “distorting” the customer profile. Fortunately, this is not critical for our dataset (at least in the exploratory phase) since the vast majority of borrowers had only one loan and none had disproportionately many in our dataset. Moreover, one could argue that members with multiple loans deserve greater weights in the borrower profile. Regardless, we will assume that each record represents a listing, a loan, and a borrower for sake of simplicity.

We are now ready to dive into our analysis proper (pun intended)!

Analysis | Part 1: Borrowers

Firstly, I am curious about the borrowers. Are they mostly from the West Coast, where Prosper is based? What do they do for a living? Are they white-collar yuppies with good income looking to ‘leverage up’ their lifestyles, or blue-collar workers borrowing to make ends meet? How risky are these borrowers? We will explore these questions in this section.

Where are they from?

The map below shows that Prosper does have the highest concentration of users in its home state of California, but there are also signficant swaths of users in New York, Texas, Florida, and Illinois - also the largest states by population. We will dig into population-adjusted user count in a later section, but initial evidence suggests that Prosper’s borrower base may be more geographically diverse than I had anticipated.

What do they do for work?

Propser tracked 68 different categories for Occupation, which offers more granularity than we need. For instance, “Student” is subdivided into “Student - College Freshman”, “Student - College Sophomore”, “Student - College Junior”, etc. The same applies for “Tradesmen”, “Engineer”, and “Food Service”. I collapsed these categories and examined the 30 most common occupation groups.

The chart below shows a good mix of traditionally “high-income jobs” such as computer programmer and engineer, vs. traditionally “low-income jobs” such as construction and laborer. In fact, high-income jobs like “Sales”, “Computer Programmer”, and “Executive” rank amongst the top 5 occupations. This suggests that Prosper appeals to a broad user base and is not simply regarded as an alternative to pay-day loan companies helping lower-income borrowers make ends meet.

(Note: The occupations’ income types were assigned at the author’s discretion)

With a median of $56,000, the income distribution of Prosper borrowers roughly mirrors that of the general U.S. population, corroborating our belief that they come from all walks of life. In fact, the borrowers may even skew toward high-income workers, as more than 15% of them have incomes in excess of $100,000. The skeptical reader might wonder if these numbers were inflated. For what it is worth, the dataset contains a variable called “IncomeVerifiable” indicating whether borrowers provided documentation to support their stated incomes. Filtering for that does not change the income distribution appreciably, which gives us some comfort about the veracity of the data.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    38200    56000    67230    81790 21000000

How risky are they?

Perhaps the most important question to ask about the borrowers is: how risky are they? Are they over-leveraged borrowers seeking loans through Prosper because they couldn’t fulfill their financing needs through traditional means? To answer this question, I looked at four key credit metrics: RevolvingCreditBalance, BankcardUtilization, DebtToIncomeRatio, and CreditScore. (Note: CreditScore was derived as the midpoint of CreditScoreRangeUpper and CreditScoreRangeLower that Prosper provided)

Firstly, let’s get a quick sense of the data by looking at the scatterplot matrix. Note that there are too many data points to make for a meaningful scatterplot, so I showed a random sample of 1000 loans originated in 2014 instead. One observation is that these variables are not as correlated as we might have expected. With the exception of the strong negative correlation of -.45 between CreditScore and BankcardUtilization, the relationships between other variables are quite weak. This perhaps imply that each metric adds incremental information about the borrower, in turn justifying why Prosper and other lending firms incorporate many credit variables in their risk models.

Next, let’s take a look at the summary statistics of the full dataset for these variables. The median borrower has ~$8000 in RevolvingCreditBalance, 60% BankcardUtilization, DebtToIncomeRatio of 0.22, and a CreditScore of 690. Sounds like pretty credit-worthy borrower profile to me.

##  RevolvingCreditBalance BankcardUtilization DebtToIncomeRatio
##  Min.   :      0        Min.   :0.000       Min.   : 0.000   
##  1st Qu.:   3076        1st Qu.:0.300       1st Qu.: 0.140   
##  Median :   8514        Median :0.600       Median : 0.220   
##  Mean   :  17586        Mean   :0.561       Mean   : 0.276   
##  3rd Qu.:  19505        3rd Qu.:0.840       3rd Qu.: 0.320   
##  Max.   :1435667        Max.   :5.950       Max.   :10.010   
##  NA's   :7604           NA's   :7604        NA's   :8393     
##   CreditScore   
##  Min.   : 10.0  
##  1st Qu.:670.0  
##  Median :690.0  
##  Mean   :695.5  
##  3rd Qu.:730.0  
##  Max.   :890.0  
##  NA's   :591

Now, let’s examine the distribution of these variables. I should first caveat that RevolvingCreditBalance and DebtToIncomeRatio have very long tails, thus for presentation purposes they are only plotted up to their 99th percentiles of $151K and .87 respectively. In addition BankcardUtilization was also capped at the theoretical maximum of 1.0.

The charts below are consistent with our initial assessment that the borrowers’ credit profile is pretty strong. For instance, the vast majority of borrowers carry debt less than 50% of their income, and nearly a quarter have credit scores at or above 740, commonly the benchmark for “excellent credit”. Perhaps the one area of potential concern is BankcardUtilization, which seems a tad high with nearly 30% of borrowers having 80% utilization or higher. Yet, it explains why these borrowers may be seeking alternative sources of funding in the first place.

Thus far we have only explored the data set in aggregate, but we should be cognizant that these data points were amassed over a span of ten years, and the borrower profile could have changed drastically over time. As good measure, we will analyze the distribution of credit scores by year of loan origination as shown in the following chart. It appears that borrowers’ CreditScore have actually improved over time. At the company’s inception, we see that the median CreditScore (blue line) of borrowers was decidedly below 700 (black dotted line). However, in recent years it appears that the majority of borrowers have CreditScore above 700.

However, we should not jump to the hasty conclusion that Prosper borrowers’ credit profile has been improving over time. A deeper scrutiny into the mean and median CreditScore suggests that there was indeed a huge improvement between 2006 and 2009, when the median improved 100 points from 610 to 710; however, the median CreditScore has held steady at 710 since then. Moreover, the mean had actually declined slightly from the peak of 715 in 2009 to 704 in 2014. Not that it is necessarily a bad thing: while Prosper should keep out the riskiest borrowers (which it does, by requiring a minimum credit score of 640), it may not be in Prosper’s interest to raise the bar so high that it keeps other potential borrowers out of its marketplace.

## Source: local data frame [9 x 4]
## 
##   LoanOriginationYear CreditScoreMean CreditScoreMedian     n
##                 (chr)           (dbl)             (dbl) (int)
## 1                2006        609.4941               610  5337
## 2                2007        654.3368               650 11460
## 3                2008        674.4200               670 11552
## 4                2009        715.2760               710  2047
## 5                2010        714.8549               710  5652
## 6                2011        709.6420               710 11228
## 7                2012        711.7215               710 19553
## 8                2013        709.0607               710 33492
## 9                2014        703.6629               710 11327

How does Prosper’s proprietary credit metrics compare with external credit scores?

Prosper’s dataset contains two different proprietary credit metrics: ProsperRating (AA, A, B, C, D, E, HR) and ProsperScore (1 through 11, with 11 being the “best”). It appears that most borrowers are assigned ProsperRating between A-D and ProsperScore between 4-8. (Note: the analysis below is based on loans originated after 1-Aug-2009 as these metrics were implemented only after July 2009.)

One naturally wonders what the difference between the two metrics is.

According to Prosper’s 10K filings, ProsperScore is indicative of the probability of a loan going “bad”, that is, “more than 60 days past due within twelve months of the application date”. It is modeled based on the historical behavior of Prosper’s borrower population, utilizing both credit report variables (e.g., inquiries last six months) as well as information derived from member-provided information (e.g., debt-to-income ratio).

ProsperRating, on the other hand, is derived primarily from that ProsperScore as well as credit scores obtained from external agencies, with each letter grade corresponding to an estimated annualized loan loss rate range. Since ProsperRating is largely derived from ProsperScore, it should not be surprising to see a strong correlation between the two, as demonstrated in the box plot below.

Next, we are also interested in the relationship between these two proprietary credit metrics with CreditScore from the external agency. Since all of these variables effectively measure the credit risk of the borrower, we would expect all three of them to be highly correlated. Previously, we have shown that is indeed the case for ProsperScore vs. ProsperRating.

The box plots below show that ProsperScore and ProsperRating also correlated well with the CreditScore. Except for a couple of slight anomalies, better ProsperRating (‘AA’ being best) and ProsperScore (‘11’ being the best) generally correspond to higher CreditScore. However, it is interesting to note that there are several buckets of ProsperScore (2-4, 5-7, 8-9) that had the same interquartile range of CreditScore. As further analysis, it would be interesting to know what differentiates the group with higher ProsperScore from those with lower ones within each of these buckets.

Analysis | Part 2: Loans

What is the typical size and tenure of these loans? What are these loans used for? Can borrowers get these loans quickly and at attractive rates? Who actually funds these loans? In this section, we will explore all of these questions.

What is the general nature of these loans?

First I am curious about the tenure of these loans. From Prosper’s website, we know borrowers can only choose between 36-month and 60-month loans. I wonder if this held true in previous years. Plotting the distribution of Term across each year, we see that Prosper offered only 36-month loans at its inception, and then added 12-month and 60-month options in Q4 2010. However, Prosper dropped the 12-month option in Q2 2013 (presumably because it failed to gain traction with lenders and borrowers), leaving us with only two options today.

In this section we will restrict our analysis to loans originated after 2010, when borrowers have at least two Term options to choose from, and focus our analysis on shorter (36-month) vs. longer (60-month) term loans. We note that there exists a third option (12-month) prior to Q2 2013 but since it was never a popular option, it does not impact our analysis much.

We see that 36-month is by far the more popular option, accounting for just under 65% of all loans. While there is significant demand for longer tenure loans (>30%), it appears that Prosper made the right decision in launching with 36-month loans, if it had to pick one single loan Term.

In terms of loan sizes, we see that the majority of loans are relatively small, with nearly 55% of them under $10,000 and vast majority under $20,000. This is largely a function of Prosper’s policy of allowing up to a maximum loan of $35,000, but it makes sense that Prosper does not accept overly huge loan requests because lenders are likely to prefer diversifying their risks over many small loans rather concentrating their risks in a large one.

In terms of ListingCategory, “Debt Consolidation” is by far the most popular reason for obtaining loans from Prosper, accounting for 50.8% of all loans. There are also significant numbers of borrowers seeking loans for business and home improvement purposes.

An interesting question is whether Term vary by ListingCategory? For instance, one might expect “business” and “auto” loans to be longer-term, whereas loans for “household expenses”, “vacation” and “taxes” to be used for bridging short-term cash flow gaps. However, the data did not support all of my intuition. Sure enough, “household expenses” and “vacation” were among the Top 3 categories with lowest percentage of 30-month loans. However, “wedding” loans turned out to have the highest percentage of 30-month loans, and “auto” the lowest, which ran counter to my intuition. Overall, however, there does not appear to be much variation in term composition across various ListingCategory.

Instead, loan size appears to be what drives the choice of loan term. On the LHS below, we see that 12-month loans are generally smaller than 36-month loans, which are in turn smaller than 60-month loans. On the RHS, we see the distribution of loan sizes by category, which may partially explain the counter-intuitive results we saw previously: “Auto” loans has one of the smallest average loan size, whereas “wedding” loans has one of the highest.

Can borrowers obtain these loans quickly and at attractive rates?

We are interested in a metric that tells us how long it takes for a loan request to be funded. Let’s call this TimeToFund and define it as the number of days between ListingCreationDate and LoanOriginationDate, plotted below. Nearly 65% of requests funded within ten days, which seems reasonably quick, though we should caveat that there are loans that ultimately got funded and we do not have information about those that never did.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       5       8      12      13    1095

Moreover, the speed of funding has increased over time, with the median TimeToFund improving steadily from 10 days in 2011 to merely 5 days in 2014. For the qualified borrower, Prosper certainly seems to be a quick funding source.

The chart belows show the range of BorrowerRates for loans originated in 2014, broken down by ProsperRating. The least risky AA-rated borrower enjoys fairly low interest rates of 7.5% on average. For each “rung” down the prosper rating (AA to A, A to B), the borrower pays an additional 4-6% interest on average, though the increase from D to E seems considerably larger. Indeed, borrowers with ProsperRating E or worse generally pay between 27-31% interest, which seems high considering the credit card APR for poor-credit folks typically ranges from 18-24%. Nevertheless, this is consistent with Prosper’s policy to keep out the most risky borrowers (with credit scores below 640).

In the previous box plots, it was not surprising that each ProsperRating had a tight range of BorrowerRate. After all, ProsperRating represents the firm’s view of the borrower’s riskiness based on its proprietary algorithm, which in turn determines how much (interest) that borrower ought to be charged.

Now let’s take a look at the relationship between BorrowerRate, CreditScore, and Term below. As expected, longer term loans command higher interest rates, and there is a negative correlation between BorrowerRate and CreditScore. However, there is considerable spread in BorrowerRate among those with low CreditScore. It is possible, albeit less likely, for a borrower with average to good CreditScore to obtain financing at rates comparable to those with excellent scores (e.g. below 10%). This confirms that Prosper’s algorithm looks beyond CreditScore, and that folks with low scores but otherwise strong credit profile could potentially stilll obtain cheap financing through its marketplace.

Who are funding these loans?

Prosper understands that lenders may be reluctant to put all egs in one basket and prefer to diversify their investments across many borrowers. Thus it allows loan requests to be funded by multiple lenders, requiring only a minimum of $25. Many investors take advantage of this option, with the median loan being funded by 45 investors. As seen in the bar chart below, there is also a very long tail, with a 99th percentile and maximum of 218 and 1189 investors respectively. On the other hand, nearly a quarter of the loans were each funded by only one person. These might be investors who perform significance due diligence on each borrower and thus have greater confidence that the borrower will ultimately pay back.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00   45.00   81.26  116.00 1189.00

## [1] 0.2368072

The next chart shows the relationship between loan amount, term of loan, and number of investors. Firstly, we note that the minimum loan amount allowed by Prosper is $2000. Secondly, we also note that there is a theoretical ceiling in terms of number of borrowers for each loan amount due to Prosper’s $25 minimum investment rule. However, few loans, especially larger loans, come close to hitting this ceiling as it requires every borrower to want to invest only $25.

One trend we might have expected to see is that longer-term loans needed to be funded by more investors because each individual investor might be more “skittish” about taking on a longer-term loan and thus inclined to invest less dollar per person. However, our sample data does not support this hypothesis. The expected number of investors is effectively the same for loans below $25,000, though they begin to diverge after, with longer-term loans requiring more investors as expected. However, we should read this result with a grain of salt, given the relatively fewer data points we have for higher loan amounts.

Prosper also facilitates and tracks lending from investors who are friends of the borrower. Theoretically, some might find Prosper’s infrastructure helpful in formalizing loan agreements between friends, as an alternative to “handshake agreements” that might prove awkward down the road. However, this does not appear to have taken off, as fewer than 2% of loans historically were funded by at least one friend, and by 2014 the number has effectively dwindled to zero (only 3 out of 11324 loans had investment from friends).

##    FriendsCount  2007  2008 2009 2010  2011  2012  2013  2014
## 1             0 10762 10710 1945 5426 11079 19479 33455 11324
## 2             1   606   701   87  202   137    69    30     3
## 3             2    68   100   10   18     9     4     6     0
## 4             3    12    19    3    4     1     1     0     0
## 5             4     2    11    1    1     0     0     0     0
## 6             5     2     4    0    1     1     0     0     0
## 7             6     1     1    1    0     0     0     1     0
## 8             7     0     2    0    0     0     0     0     0
## 9             8     3     0    0    0     0     0     0     0
## 10            9     2     2    0    0     1     0     0     0
## 11           13     0     1    0    0     0     0     0     0
## 12           15     1     0    0    0     0     0     0     0
## 13           20     1     0    0    0     0     0     0     0
## 14           33     0     1    0    0     0     0     0     0

Analysis | Part 3: Prosper’s Growth

In this final section we want to examine Prosper’s growth trends over the years since inception, both at the aggregate and state level.

How have loans originated grown over time on aggregate?

It appears that Prosper’s business has gone through a rough patch during the recession, with number of loans and total loan amount plunging in 2009 and continuing to languish in 2010 at levels even lower than in the year of inception. In fact, there were rumors then about Prosper hemorrhaging cash and might soon be out of business. Fortunately, Prosper has recovered since, with number of loans and total loan amount originated growing at 200% and 250% CAGR since its lows in 2009. Prosper originated over 33K loans amounting to $353M in 2013 alone.

Summary

Key takeaway #1: Prosper borrowers are fairly credit worthy

Prosper borrowers’ median credit score has improved from 610 at its inception to 710 by 2009 and held steady since. Most recently in 2014, more than a quarter of borrowers had excellent credit scores at or above 740, and none had scores below 640.

Key takeaway #2: There is a lot more to determining borrower rates than just credit scores and loan terms

On average, borrower rates are indeed lower for those seeking shorter loan terms, and those with higher credit scores. However, there is considerable variance left unexplained. For instance, borrowers with credit score of 650 may be charged anywhere from 10% to 31% interest for a 36-month loan.

Key Takeaway #3: California has the largest number of users in absolute terms, but not on population-adjusted basis

California ranks in the middle of pack both in terms of penetration and growth rates, lagging considerably behind states such as Maryland, Connecticut, Rhode Island. Overall, however, Prosper is seeing positive growth across all states where it is legally allowed to operate.

Reflection

I really enjoyed exploring this dataset. Not only was I able to answer questions I’ve always had about P2P lending (e.g. how it worked, who borrows and lends through such platforms), I also managed to sharpen my data manipulation skills in R and better appreciate the power of ggplots. I feel much more confident about tackling a new dataset, surveying it, and leveraging ggplots to create a variety of plots to identify and represent key statistical findings.

Of course, the project was not without its challenges. First of all, there were many variables in the dataset and it took a while for me to narrow down the scope of the analysis and determine the subset of variables I wanted to explore. Secondly, I had gotten stuck on several plot features I wanted to implement but were not covered in the course instructions, e.g. choropleth map, 100% stacked chart, adding median lines to facet wraps, and spent considerable time googling for the right solutions. I came to realize that knowing how best to phrase a coding question in Google and StackOverflow is a critical skill that data scientists should have. A third issue that might be unique to this dataset is that it was collected over a span of ten years. There were instances where analyzing the data in aggregate was appropriate, but others where it made more sense to restrict the analysis to only the recent data. On hindsight, I did so in a slightly haphazard manner and could have applied more structure to the analysis at the outset.

Of course, this study was meant as an exploratory data analysis and we are only beginning to scratch the surface of extracting valuable insights from the data. There are a number of additional analyses that I would love to explore for future work. Firstly, I would love to examine which other factors (besides term and credit score) may reconcile the ProsperRating and hence BorrowerRate charged, and effectively, back out Prosper’s proprietary risk model. Secondly, I would explore the performance of Prosper loans in terms of deliquencies and loan losses. Last but not least, if I had access to the full listing data, I would love to know what proportion of listings get funded, and what characteristics predict whether a listing will be funded or not.